ggplot2 plots are built in a series of layers
Plot space is built
The variables are specified
The type of visualisation (known as a geom) that is desired for these variables is specified - in this case geom_point() is called to visualise individual data points;
A second geom is added to include a line of best fit, the axis labels are edited for readability
A theme is applied to change the overall appearance of the plot.
These layers allow a lot of customizability by allow tweaking of layers independently of other layers, and easy adapting of existing code.
Run this just once, as you only need to install the package a single time. (Well, except that you may need to update it occassionally)
Let’s load the data contained in tidyverse, “starwars” containing data about characters from Star Wars movies. The summary function is contained within base R.
Let’s inspect “mass”, the mass of each star was character
The psych package’s “desribe” function is also helpful. The “::” argument allows you to call a single function from a package without calling the full package and all of its contents.
You can also do it using tidyverse approach:
starwars %>%
group_by(gender) %>%
summarise(
mean = mean(mass, na.rm=T),
median = median(mass, na.rm=T),
sd = sd(mass, na.rm=T),
min = min(mass, na.rm=T),
max = max(mass, na.rm=T),
count = n()
)# A tibble: 3 × 7
gender mean median sd min max count
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 feminine 54.7 55 8.59 45 75 17
2 masculine 106. 80 185. 15 1358 66
3 <NA> 48 48 NA 48 48 4
Maybe we want to look at some data by gender so let’s “factorize” it:
[1] "masculine" "masculine" "masculine" "masculine" "feminine" "masculine"
[7] "feminine" "masculine" "masculine" "masculine"
starwars <- mutate(starwars, gender = factor(
x = gender, # column to translate
levels = c("masculine", "feminine"), # values of the original data in preferred order
labels = c("masculine", "feminine") # labels for display
))
starwars$gender[1:10] [1] masculine masculine masculine masculine feminine masculine feminine
[8] masculine masculine masculine
Levels: masculine feminine
Let’s take a look at the mass by gender:
Descriptive statistics by group
group: masculine
vars n mean sd median trimmed mad min max range skew kurtosis
X1 1 49 106.15 184.97 80 81.08 10.38 15 1358 1343 6.31 39.82
se
X1 26.42
------------------------------------------------------------
group: feminine
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 9 54.69 8.59 55 54.69 7.41 45 75 30 1.24 0.69 2.86
Woah! Males have a massive standard deviation… And it looks like the the max is huge too. This may be a good time to inspect the data. What would be a good way to plot this?
Let’s do the histogram step by step…
Layer 1: Plot frame
data specifies which data source to use for the plotLayer 2: Variables specified
mapping specifies which variables to map to which aesthetics (aes) of the plot. Mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms.
x specifies which variable to put on the x-axis
define an aesthetic mapping (using the aesthetic (aes) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g., as x/y positions or characteristics such as size, shape, color, etc.
Layer 3: Type of visualization.
Woah! Big outlier
Layer 3 still: Customization
Layer 4: Theme
Layer 5: Facet wrap
Facet wrap can be specified using the tilde but can also be specified using vars():
Layer 5: Fill
You could also depict gender using fill.
Level 6: Labels
You can also customize the binwidth:
It’s Jabba the Hut! Let’s remove his weight (i.e., assign NA to it) and re-run the plots.
Males look a lot more sensible to interpret without the massive outlier
Using facet_wrap
Now we have an understanding of the distributions of mass for the different genders in Star Wars, maybe we want to plot their means. How would we do that?
Organize data for plotting
Layer 1: Initialize plot
Layer 2: Variables
Layer 3: Bar plot
Note: The default for geom_bar is to count the data provided and that you should not specify a y in aes().
Layer 4: Error bars
Customize: Fill bars with red, outline with blue, error bars colored pink, classic theme
Customize axes, including “breaks”
summedMass %>%
ggplot(aes(x=gender,y=mass)) +
geom_bar(fill="red",color="blue",stat="identity") +
geom_errorbar(color="pink",aes(ymin=mass+SD, ymax=mass-SD), width=.2) +
theme_classic() +
scale_x_discrete(name = "Gender",
labels = c("Males", "Females")) +
scale_y_continuous(name = "Average Weight",
breaks = c(0,50,100))However, bar plots can be a bit limited in how they convey information about summary statistics as they do not depict clearly the distribution and spread of scores.
First, films column contains a bunch of lists in each row. Let’s fix that.
[[1]]
[1] "The Empire Strikes Back" "Revenge of the Sith"
[3] "Return of the Jedi" "A New Hope"
[5] "The Force Awakens"
[[2]]
[1] "The Empire Strikes Back" "Attack of the Clones"
[3] "The Phantom Menace" "Revenge of the Sith"
[5] "Return of the Jedi" "A New Hope"
[[3]]
[1] "The Empire Strikes Back" "Attack of the Clones"
[3] "The Phantom Menace" "Revenge of the Sith"
[5] "Return of the Jedi" "A New Hope"
[7] "The Force Awakens"
[[4]]
[1] "The Empire Strikes Back" "Revenge of the Sith"
[3] "Return of the Jedi" "A New Hope"
[[5]]
[1] "The Empire Strikes Back" "Revenge of the Sith"
[3] "Return of the Jedi" "A New Hope"
[5] "The Force Awakens"
starwarsunnested <- starwars %>%
unnest(films) %>% # unnest list
drop_na() %>% # remove NAs
add_count(gender, films) %>% # count how many appearances of each gender within each film
filter(n > 1) # retain if gender appears more than once in film
starwarsunnested$films[1:5][1] "The Empire Strikes Back" "Revenge of the Sith"
[3] "Return of the Jedi" "A New Hope"
[5] "The Force Awakens"
Now, let’s plot the mass of different Star Wars characters by gender and by film, using box plots.
Violin plots also communicate the spread of data, by essentially flipping density plots sideways .
pos <- position_dodge(0.9)
starwarsunnested %>% drop_na() %>% ggplot(aes(x = gender, y = mass, fill=films)) +
geom_violin(position = pos) +
geom_boxplot(width = .2,
fatten = NULL,
position = pos) +
stat_summary(fun = "mean",
geom = "point",
position = pos) +
stat_summary(fun.data = "mean_se",
geom = "errorbar",
width = .1,
position = pos)These are arguably my favorite for summary statistics. I think they best capture the best of both worlds. Plotting the miles per gallon for cars based on whether they have an automatic or manual transmission.
mtcars %>% drop_na %>% ggplot(aes(x=as.factor(am),y=mpg)) +
ggdist::stat_halfeye(adjust = .5, width = .7, .width = 0, justification = -.2, point_colour = NA) +
geom_boxplot(width = .2, outlier.shape = NA) +
geom_jitter(width = .05, alpha = .3) +
labs(y="Miles per Gallon") +
scale_x_discrete(name = "Transmission",
labels = c("Automatic", "Manual"))The base R package “Nile” is a timeseries object which contains a particular type of data for timeseries analysis and plotting. We are going to convert it to a dataframe so we can use it for ggplot2.
Now, let’s plot it as timeseries data using a lineplot:
Maybe we want to customize it a bit. Let’s add points for each time point, and increase the thicknes and change the linetype. And let’s change the background to blue!
ggplot(NileDf, aes(x=Year, y=Nile)) +
geom_line(size=1.5, linetype="dashed") + theme_linedraw() + labs(title="Measurements of the annual flow of the river Nile at Aswan",y="Annual flow in 10^8 m^3",x="Year") +
geom_point(size=2) +
theme(panel.background = element_rect(fill = "lightblue",
colour = "lightblue",
size = 0.5, linetype = "solid")
)Okay, that looks terrible. But you get the picture.
Scatterplots can be useful for depicting the relationship between two continuous variables
First, let’s clean up mtcars a bit and convert transmission to a categorical factor and convert the rownames that contain car names to an actual column/variable titled “Cars”.
Let’s plot the relationship between horsepower and miles-per-gallon. All you need to do is use “geom_point”. You can customize this and do more things with it but this is the basis of a scatterplot.
This isn’t so clear those because the legend doesn’t depict the labels and the axes aren’t very explicit. How would we tweak the graph to fix this?
Perhaps you want to have the colors differ based on whether it is automatic or manual transmission.
Maybe we want to get real fancy, and add label for each car!
mtcars %>% ggplot(aes(x=mpg,y=hp,color=am)) +
scale_color_discrete(name="Transmission", labels=c('Automatic', 'Manual')) + scale_x_continuous(name="Miles per Gallon") + scale_y_continuous(name="Horsepower") +
geom_text(size=3, aes(label = Cars, size = NULL), nudge_y = 0.7, check_overlap=T) + geom_point()Finally, sometimes it can be useful to depict the plot with a best-fit line through the data points, which can be done using geom_smooth. This defaults to the loess method, which is nonlinear.
mtcars %>% ggplot(aes(x=mpg,y=hp,color=am)) +
scale_color_discrete(name="Transmission", labels=c('Automatic', 'Manual')) + scale_x_continuous(name="Miles per Gallon") + scale_y_continuous(name="Horsepower") +
geom_text(size=3, aes(label = Cars, size = NULL), nudge_y = 0.7, check_overlap=T) + geom_point() +
geom_smooth()If you want to see a best-fit ordinary-least-squares line through your data points, you can add method=lm:
mtcars %>% ggplot(aes(x=mpg,y=hp,color=am)) +
scale_color_discrete(name="Transmission", labels=c('Automatic', 'Manual')) + scale_x_continuous(name="Miles per Gallon") + scale_y_continuous(name="Horsepower") +
geom_text(size=3, aes(label = Cars, size = NULL), nudge_y = 0.7, check_overlap=T) + geom_point() +
geom_smooth(method = lm)You can also remove the confidence intervals of the line with “se=F”:
mtcars %>% ggplot(aes(x=mpg,y=hp,color=am)) +
scale_color_discrete(name="Transmission", labels=c('Automatic', 'Manual')) + scale_x_continuous(name="Miles per Gallon") + scale_y_continuous(name="Horsepower") +
geom_text(size=3, aes(label = Cars, size = NULL), nudge_y = 0.7, check_overlap=T) + geom_point() +
geom_smooth(method = lm, se=F)Let’s use a different dataset for this one so we can have more levels of a factor available. The European Social Survey has multilevel self-report data available for a variety of countries. There are certain codes used for non-responses though so let’s replace those. 77, 88, 99 are non-responses of some kind.
df <- data.table::fread("ESS10.csv")
# Set up data for plotting
df <- df %>%
mutate(fairelcc = replace(fairelcc, fairelcc > 76, NA), # In country national elections are free and fair
medcrgv = replace(medcrgv, medcrgv > 76, NA), # In country the media are free to criticise the government
gvctzpv = replace(gvctzpv, gvctzpv > 76, NA), # The government protects all citizens against poverty
wpestopc = replace(wpestopc, wpestopc > 76, NA) # In country the will of the people cannot be stopped
) %>%
mutate(gndr = as_factor(ifelse(gndr==9, NA, gndr)) ) Let’s plot the relationship between “media has right to criticize government” with “national elections are free and fair”. Let’s include different shapes for males and females, and let’s have the color of the lines and data points differ based on country.
Okay, that’s pretty cool. But what if we want to tweak the colors further?
Many color palettes only have 5 to 10 different colors, which only allows you to generate colors for 5-10 levels of a categorical factor. However, fret not! colorRamp allows you to interpolate colors betw
Only 5 colors, so let’s interpolate…
Concatenate plots
Stack plots
Concatenate and stack
Two concatenates and stacks:
I like patchwork a lot but as you can see, if you’re plotting multiple plots with the same legend, then it tries to fit all of them which can be redundant. I don’t believe there’s any option for using a common legend with patchwork. But there’s a solution!
Same logic
Common legend
| Value | Word |
|---|---|
| linetype = 0 | linetype = “blank” |
| linetype = 1 | linetype = “solid” |
| linetype = 2 | linetype = “dashed” |
| linetype = 3 | linetype = “dotted” |
| linetype = 4 | linetype = “dotdash” |
| linetype = 5 | linetype = “longdash” |
| linetype = 6 | linetype = “twodash” |
ggsave can be done in inches, centimeters, millimeters, pixels at a variety of heights, saved to different location. It can be saved as “png”, “eps”, “ps”, “tex” (pictex), “pdf”, “jpeg”, “tiff”, “png”, “bmp”, “svg”
Option #1
Could print plot, and it will default to saving last plot
Option #2
Second argument is plot and the first argument is filename and you can specify full directory in filename
Can also specify the path directory in path argument and filename in first argument
There is also base R saving which operates as follows. Print plot after calling png, jpg, etc. Can change dpi and dimensions. Then call dev.off to save file and return control to screen.